Image CC BY-SA:
bertogg @ flickr
Clipart: openclipart.org
Presentation: (C) Henrik Ingo, 2017. Please share and reuse as explained in the CreativeCommons Attribution License.
English: lifeboat, floating device
Academic: Leader based distributed consensus algorithm
John Osterhout, Diego Ongaro (USENIX 2014)
In practice: Single-master synchronous replication protocol
Similar to Vievstamp replication. Alternative to Paxos.
Easy to understand
Easy to discuss trade offs
Complete and practical
(Actually works if implemented)
Raft replicates a log. The state machine (aka database) is secondary.
At any given time, (at most) one node is leader.
Leaders are chosen with elections.
Time between elections is called term.
When observing new term, servers must immediately adopt it.
index | term | command |
1 | 1 | a <- 5 |
2 | 1 | a <- 7 |
3 | 1 | b <- 1 |
4 | 2 | ... |
5 | 2 | ... |
6 | 4 | ... |
7 | 4 | ... |
8 | 4 | ... |
9 | 4 | ... |
|
|
Adopt term concept
Replication counts as heartbeat
Highest priority node likely,
but not guaranteed to win
Freshness: Only majority commit guaranteed
Simplify: Vote for first qualified candidate
No veto!
> db.oplog.rs.find().sort( { $natural : -1 } ).limit(1).pretty()
{ "ts" : Timestamp(1444466011, 1),
"h" : NumberLong("-6240522391332325619"),
"v" : 2,
"op" : "i",
"ns" : "test.nulltest",
"o" : { "_id" : 3, "a" : 3 }
}
unix time + counter
hash = gtid
> db.oplog.rs.find().sort( { $natural : -1 } ).limit(1).pretty()
{ "ts" : Timestamp(1445632081, 861),
"t" : NumberLong(42),
"h" : NumberLong("5466055178864103715"),
"v" : 2,
"op" : "u",
"ns" : "ycsb.usertable",
"o2" : { "_id" : "user645414720104877157" },
"o" : { "$set" : { "field4" : BinData(0,"KT5sN1wrPl...Ny9wIg==") }
}
same, used as Raft index
Raft term
random integer = gtid
A young woman trying on earrings.
Rembrandt
CC-BY jankruithof @ Flickr